Overview

Subquestions:

1. What is the most frequent word used in the Populor Songs?

4. Is there a pattern for those successful songs?

5. Is there a relationship between the mood in songs with the Twitter Users’ attitudes?

Data Collection and Data Cleaning

Music Data

This project collected the information for music and Twitter posts from different APIs.

The billboard.py, a Python API for accessing music charts from Billboard.com, is used to collect the tiltles and artists’ names. With the music information, the PyLyrics, a python module to get lyrics of songs from lyrics.wikia.com, helped to find those lyrics.

The biggest problem for the data cleaning in this part is those special signs in the titles or the singer lists:

Graph 1(Tableau)

Besides, there are also some languages other than English in the dataset:

Graph 2(matplotlib), Source Code
Graph 3(matplotlib), Source Code

We Can see that there are very few songs using languages other than English. The songs with different languages are deleted so that the sentimental analysis wil be more accurate.

The comparision between data before and after data cleaning

Graph 4(Tableau)

Twitter Data

Twitter data cleaning

When cleaning Twitter data, we first check whether same Twitter post link appear more than once in our dataset, if it does we remove all those duplicate records. Then we removed those non-alphabetic symbol and also hyper-link from text data. And the plot below provide statistics on those records we cleaned.

Graph 6(matplotlib), Source Code

Twitter sentiment analysis

To generate sentiment score for each Tweets, we build our sentiment analysis tool using Word2Vec and Deep Learning tools. We first obtain our training dataset which has 1.4 million of Twitter data with labelled sentiment score from Sentiment140. Then we tokenize both training data and our own Twitter data using NLTK tokenizer. Then vectorize each token using gensim Word2Vec model with vector size 512 and window size 10. Then construct our 7 layer Convolutional Neural Network model and trained it using our training data.

With 10 fold cross validation, we gained 75% accuracy and we apply our model to predict sentiment score on collected Twitter data. Which yield frequency distribution as below:

Graph 7(matplotlib), Source Code

1. What are the most frequent words used in Populor Songs?

Let’s start to find our answer by showing some EDA plot. In the first sub-question, our team want to simply study what is the word used most frequently in the lyrics. We first use the word cloud plot for all the songs in Billboard Top 100 to study it.

Graph 8(matplotlib, Word Cloud), Source Code

Talyor Swift Example

Graph 18(ggplot), Source Code

The average word count for the tracks stands close to 375, and chart shows that maximum number of songs fall in between 345 to 400 words. The density plot shows that the distribution is close to a normal distribution.

Graph 19(ggplot), Source Code

The Album-wise word count shows that there tend to be more words in the lastest album.

Graph 20(ggplot), Source Code

This plot shows the similiar result. This means, to be a super star, your need to convey more information in your songs, with which you can influence your audience.

Graph 21(ggplot), Source Code

Basically, the most frequent mood in the songs is positve. And we can see that Talyor Swift have expressed all kinds emotions in her songs. Joy, anticipation and trust emerge as the top 3.

Graph 22(ggplot), Source Code

Now that we have figured out the overall sentiment scores, we should find out the top words that contribute to various emotions and positive/negative sentiment.The visualization given shows that while the word bad is predominant in emotions such as anger, disgust, sadness and fear, Surprise and trust are driven by the word good.

Graph 23(ggplot), Source Code

We can see that joy has maximum share for the years 2010 and 2014. Overall, surprise, disgust and anger are the emotions with least score; however, in comparison to other years 2017 has maximum contribution for disgust. Coming to anticipation, 2010 and 2012 have higher contribution in comparison to other years.

Some other visualizations graphed by seaborn to see the popularity of the songs and how it changed during the years or even months.

Graph 24(Seaborn), Source Code

The graph above shows the density of relatively time x-aixs and the lengths of weeks that songs stayed on board. It’s eaily to tell that the graph has two part of high density, which means in the middle time of these years. Around year 2014, there are lot great songs been created. Also, around year 2012, a lot songs been created but did not stay on the board as long as the songs been created in 2014.

Graph 25(Seaborn), Source Code

The graph above shows the number of favorites and retweets to the songs in ten years. As we can directly see, there is a huge difference between before year 2017 and after year 2017. After year 2017, the number of favorties and retweets increased incredibly. That tells either the songs on the board have more influence after year 2017 or people started to be carzy of using tweets.

Graph 26(Seaborn), Source Code

5. Is there a relationship between the mood in songs with the Twitter Users’ attitudes?

The fifth sub-question is whether there is relationship between the sentiment analysis resutls from tweets and lyrics. In other words, the question becomes “Do the lyrics’ attitudes influence people’s comments about them on tweets?” and “whether the result of two attitudes’ connection influence the popularity of the songs?”

In order to answer the questions above, we have to take a look of the results of sentiment analysis from tweets and lyrics.

Graph 28(Seaborn), Source Code

The graph above shows the two-dimensional plot of sentiment analysis attitude scores. As we can see, it seems like there is no linear relationship between the tweets’ attitude and lyrics’ attitude. However, we can not be sure yet untill we run some tests. Still, the graph gives us one interesting information. Most points loacte on the top and bottom of the graph, which means most of songs are either too negative or too positive. And those songs have more attetion on tweets than the normal songs.

Next, we took a look into each indiviual analysis result.

Graph 29(Seaborn), Source Code

As the boxplot shows above, we can see that lyrics have much higher attitude points than tweets. Most of tweets’ attitude points are below normal, which means they are more negative while most lyrics of the songs are tend to be positive. However, we still can not be 100% sure that they are significant different, so we will run one-way ANOVA test to testify our thoughts.

Hypothesis test of One-Way ANOVA test.

Hypothesis1, Source Code

As the p-value show above, since the p-value is below 0.05, which rejects the null hypothesis. Then it means there is a significant difference between lyrics’ attitude points and tweets’ attitude points. Since we know they are different, we are moving to the next part, to test whether there is a relationship between lyrics’ attitude and tweets’ attitude.

Hypothesis test of Linear regression test.

Hypothesis2, Source Code

As the results show above, we used the number of weeks that songs stayed on board as addition term that may relates to the attitude points. Then we found out that all p-vlaues are much greater than 0.05, which means it failed to reject he hyoithesis test, and it means there is no significant linear relationship between Tweets’ attitude points and neither lyrics’ attitude points nor the number of weeks that songs stayed on board( which may also be thought as popularity).